This report explores a dataset containing chemical information and the quality score of different labels of white wine.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The dataset consists of 13 variables, with 4898 observations.
The quality scores of wines seem to be normally distributed. There are very few wines being rated as 3 (very poor quality) and 9 (excellent quality). The mean(red line), median(blue line), and mode of quality ratings all fall nearby the score of 6. Based on the information given in the dataset, I wonder which factors can effectively represent the quality score of white wine.
We can see that most of the independent variables are normally distributed, except for residual sugar. We’ll perform log transformation to get a better representation of the distribution. Another interesting factor to consider is the two SO2 content variables. We can analyze the proportion of free SO2 in later analysis.
whiteWine$prop_free.sulfur.dioxide <- whiteWine$free.sulfur.dioxide / whiteWine$total.sulfur.dioxide
Creating a plotting function to simplify the code:
plot_histogram <- function(variable, binwidth = 0.01) {
return(ggplot(aes(x = variable), data = whiteWine) +
geom_histogram(binwidth = binwidth))
}
The log transformed residual.sugar distribution appears bi-modal with the peaks at around 1.1-1.6 and 8.0 g/dm^3 or so.
Chloride levels don’t seem to differ much across the wines in the dataset.
Proportions of free sulfur dioxide is distributed normally with a peak at around 25% - 28% or so.
There are 4,898 different labels of white wine with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality). All the variables are continuous.
The quality rating is on a scale of 0 (very bad) to 10 (very good). Wines in the current dataset only covers ratings of 3-9.
Other observations:
* Most white wine in the dataset have very little residual sugar content (around 1g per cubic decimeter).
* Most wines contain similar amounts of salt (sodium chloride), which peaks at .03-.06g per cubic decimeter.
The main feature of interest is the quality ratings. We look to investigate which chemicals influence the quality rating of white wines.
Both residual.sugar and alcohol levels have interesting distributions. There may also be interrelationships between some of the variables.
Yes, I took the percentage of free.SO2 level according to total.SO2 level to calculate the proportion of free sulfur dioxide content.
As the quality ratings are only integers from 3-9, I changed it into an ordered factor.
Residual sugar was distributed with a long tail thus a log transformation was performed.
There may be outliers in the dataset, but I kept them for further studies of the best and worst wines.
The four highest correlation coefficients of variables with quality:
quality~alcohol:
## [1] 0.4355747
quality~chlorides:
## [1] -0.2099344
quality~density:
## [1] -0.3071233
quality~prop_free.sulfur.dioxide:
## [1] 0.1972141
Some observations:
* Correlation coefficients for quality and other variables are not displayed. However, boxplots of quality ~ alcohol and quality ~ density show interesting patterns that worth further investigation.
* As the selected correlation coefficients have shown, quality of wine cannot be sufficiently predicted by any chemical content alone.
* Density and residual sugar seem to be positively correlated (r = 0.839).
* Chlorides and alcohol are slightly negatively correlated (r = -0.36).
* Density and alcohol are also moderately correlated (r = -0.78).
We’ll take a further look into these variables.
As there are too many levels in quality rating, plots get really messy. A quality.bucket variable is created and thus will be marked with different colors. It will help us visualize the differences among these groups.
whiteWine$quality.int <- as.integer(as.character(whiteWine$quality))
whiteWine$quality.bucket <- cut((whiteWine$quality.int), c(2, 5, 7, 10), ordered = TRUE)
table(whiteWine$quality.bucket)
##
## (2,5] (5,7] (7,10]
## 1640 3078 180
Creating a plotting function to simplify the code:
plot_boxplot <- function(variable) {
ggplot(whiteWine, aes(factor(quality), variable)) +
geom_boxplot(aes(fill = factor(whiteWine$quality.bucket)),
alpha = 0.4) +
guides(fill = FALSE) +
coord_flip()
}
We can see from the stacked histogram above that low-quality wines gather at the left side of the graph while the higher quality ones on the right side.
The relationship is not easy to see as there are too many levels of quality. Next we’ll use the quality buckets to create a better color visualization.
Conditional means/medians of alcohol content among three quality groups:
## # A tibble: 3 × 4
## quality.bucket alcohol_mean alcohol_median n
## <ord> <dbl> <dbl> <int>
## 1 (2,5] 9.84953 9.6 1640
## 2 (5,7] 10.80197 10.8 3078
## 3 (7,10] 11.65111 12.0 180
It’s clearly shown that high quality wines tend to have higher alcohol content and poorer quality wines to have lower alcohol content.
Conditional means/medians of chlorides among three quality groups:
## # A tibble: 3 × 4
## quality.bucket chlorides_mean chlorides_median n
## <ord> <dbl> <dbl> <int>
## 1 (2,5] 0.05143598 0.0470 1640
## 2 (5,7] 0.04320858 0.0410 3078
## 3 (7,10] 0.03801111 0.0355 180
After adjusting for the long tail, the difference among the three quality groups is still not very clear to see. Although the difference in chloride level is minimal, we can see that higher content of chlorides tend to exist among lower quality wines.
Conditional means/medians of density among three quality groups:
## # A tibble: 3 × 4
## quality.bucket density_mean density_median n
## <ord> <dbl> <dbl> <int>
## 1 (2,5] 0.9951600 0.99514 1640
## 2 (5,7] 0.9935299 0.99305 3078
## 3 (7,10] 0.9922144 0.99162 180
Although the conditional mean/median comparison indicates that the difference among the groups is minimal, the density plot shows that within a small range of density levels, poorer quality wines tend to have higher density level and better quality wines tend to have lower.
Conditional means/medians comparison among three quality groups:
## # A tibble: 3 × 4
## quality.bucket SO2_mean SO2_median n
## <ord> <dbl> <dbl> <int>
## 1 (2,5] 0.2322617 0.2310096 1640
## 2 (5,7] 0.2660284 0.2626691 3078
## 3 (7,10] 0.2892855 0.2876712 180
The density plot suggests that free SO2 proportion levels do not differ very much across the three quality buckets. The correlations we found from the scatterplot matrix may have been the result of covariance. We’ll plot some of these variables together to further see their interrelationships.
Quality correlates strongly with alcohol content.
The variance in alcohol levels peaks among the lower and the higher quality wines. High-quality wines tend to have higher alcohol content (11-13%) and low-quality wines tend to have lower alcohol content (8.5-10%). Medium-quality wines typically spread out nicely across alcohol content around 9-13%.
Density level also seems to have a high influence on the quality ratings of wine. The interactions between some of these factors may be important to look further into when we try to predict the quality ratings of wines.
It seems that the density level of wine is correlated with many other features. It is highly correlated with residual sugar - an increase in residual sugar content increases density levels. Density is also moderately correlated with alcohol content, a decrease in alcohol level will increase the density of wines.
Other interesting correlations are found between chlorides and alcohol, and residual sugar and alcohol. The correlation coefficient is around 0.35-0.4, but it’s hard to see the relation from the scatter plots. We’ll plot them together to see further interactions.
The strongest correlation was between quality rating and alcohol levels. The correlation is even stronger when we put wines of different quality buckets.
As we’ve found other variables that could potentially share covariance with alcohol levels, we’ll investigate further on chlorides, residual sugar, and density levels.
Density and residual sugar have a strong correlation of 0.839.
After adjusting the scales and eliminating the outlier, we can see almost three distinct upward lines for the different quality buckets. This plot suggests that density level increases when residual sugar content increases. Poor quality wines tend to have a higher density level. Median and high quality wines, on the other hand, tend to have lower density levels.
Density and residual sugar have a strong correlation of -0.78.
From the scatterplot above we can see that higher alcohol content is associated with lower density levels. Median and high quality wines tend to have higher alcohol content and lower density levels, while lower quality wines tend to have high density and lower alcohol content.
Alcohol and Chlorides have a moderate correlation of -0.36.
It seems like other than difference in alcohol content, some of the low quality wines tend to have higher chloride rate. However, the difference in chlorides seem to be minimal.
Apart from the correlation between alcohol and quality buckets, the difference in residual sugar seem to be minimal.
m1 <- lm(quality.int ~ alcohol, data = whiteWine)
m2 <- update(m1, ~. + density)
m3 <- update(m2, ~. + residual.sugar)
m4 <- update(m3, ~. + chlorides)
m5 <- update(m4, ~. + prop_free.sulfur.dioxide)
mtable(m1, m2, m3, m4, m5, sdigits = 3)
summary(m5)
##
## Calls:
## m1: lm(formula = quality.int ~ alcohol, data = whiteWine)
## m2: lm(formula = quality.int ~ alcohol + density, data = whiteWine)
## m3: lm(formula = quality.int ~ alcohol + density + residual.sugar,
## data = whiteWine)
## m4: lm(formula = quality.int ~ alcohol + density + residual.sugar +
## chlorides, data = whiteWine)
## m5: lm(formula = quality.int ~ alcohol + density + residual.sugar +
## chlorides + prop_free.sulfur.dioxide, data = whiteWine)
##
## =======================================================================================
## m1 m2 m3 m4 m5
## ---------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** 90.313*** 87.563*** 56.573***
## (0.098) (6.165) (12.374) (12.392) (12.518)
## alcohol 0.313*** 0.360*** 0.246*** 0.237*** 0.262***
## (0.009) (0.015) (0.018) (0.018) (0.018)
## density 24.728*** -87.886*** -84.931*** -54.289***
## (6.079) (12.317) (12.340) (12.461)
## residual.sugar 0.053*** 0.052*** 0.038***
## (0.005) (0.005) (0.005)
## chlorides -1.776** -1.861***
## (0.555) (0.548)
## prop_free.sulfur.dioxide 1.404***
## (0.122)
## ---------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.210 0.212 0.233
## adj. R-squared 0.190 0.192 0.210 0.211 0.232
## sigma 0.797 0.796 0.787 0.787 0.776
## F 1146.395 583.290 434.085 328.736 296.836
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5776.812 -5771.696 -5705.710
## Deviance 3112.257 3101.773 3033.737 3027.406 2946.925
## AIC 11684.782 11670.255 11563.624 11555.391 11425.420
## BIC 11704.272 11696.241 11596.107 11594.371 11470.896
## N 4898 4898 4898 4898 4898
## =======================================================================================
##
## Call:
## lm(formula = quality.int ~ alcohol + density + residual.sugar +
## chlorides + prop_free.sulfur.dioxide, data = whiteWine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5838 -0.5291 -0.0377 0.4774 3.1759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.572824 12.518291 4.519 6.35e-06 ***
## alcohol 0.261991 0.018333 14.290 < 2e-16 ***
## density -54.289385 12.461186 -4.357 1.35e-05 ***
## residual.sugar 0.037838 0.005185 7.298 3.40e-13 ***
## chlorides -1.861363 0.547903 -3.397 0.000686 ***
## prop_free.sulfur.dioxide 1.404420 0.121504 11.559 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7761 on 4892 degrees of freedom
## Multiple R-squared: 0.2328, Adjusted R-squared: 0.232
## F-statistic: 296.8 on 5 and 4892 DF, p-value: < 2.2e-16
Among the models above, model 5 captured the most variance (adj. R^2 = 0.232) in the dataset and has the lowest BIC among all. We’ll take a look at the distribution of residuals of model 5.
We can see that the majority of residuals occurred in the 5 and 6 category, the errors of which are mostly within -1 and 1. An error of 1 in this case is pretty understandable. We can say that the model does a decent job of predicting the quality score of wines.
Alcohol concentration seems to be the deciding factor for evaluating wine quality. Although other features also seem to influence the wine, their influence on wine quality is rather indirect, as they correlate more strongly with alcohol level instead of with quality ratings directly.
It’s interesting to see that the density of wine decreases as the alcohol content increases. More surprisingly, we found that the level of residual sugar also decreases as the alcohol increases. It would be fascinating to learn more about the chemical/biological reaction taking place during wine productions.
Yes, I created a linear model starting from alcohol content and density level.
The variables in the linear model only accounted for 23.2% of the variance in the quality of wines. The addition of residual sugar, chloride, and free SO2 proportion slightly improved the R^2 value by 4%, which is expected base on the visualizations of correlations found between features. Also, as taking log10 of residual sugar does not improve the goodness of fit, the feature was included in the model in its original form.
There is a strong correlation between quality rating and alcohol levels. High quality wines tend to have higher alcohol content and poorer quality wines tend to have lower alcohol concentration.
From the plot we can see that alcohol and density of wine are negatively correlated. Median and high quality wines tend to have higher alcohol content and lower density levels, while lower quality wines tend to have high density and lower alcohol content.
After we chose to fit the model: quality = 56.57 + 0.26(alcohol) - 54.29(density) + 0.04(residual.sugar) - 1.86(chlorides) + 1.40(free.SO2 / total.SO2), the residuals are plotted as above. As we can see the majority of error comes in ratings of 5 and 6 within the range of -1 and 1, we can say that the model does a decent job describing the current dataset.
The white wine dataset contains information about 5,000 labels of wine. I started by understanding the individual variables, then I explored the correlations between each pairs of features and had some interesting observations.
There was a trend between the alcohol concentration of wine and its quality. The trend is clearer when I regrouped the wines into three buckets by their quality score. Having quality rating with three levels made it easier to visualize the correlation with other features of wine.
With all the information I’ve found, I was able to create a linear model capturing the dynamic between different features of wines to predict white wine qualities. Although it only captures 23% of the variations in the dataset, the error is within an acceptable range. However, as the model only took linear variables into account, further adjustments and improvements can be made by exploring other possibilities.